Adaptive Markov Chain Monte Carlo for Auxiliary Variable 
Method and Its Application to Parallel Tempering 



Takamitsu Araki Kazushi Ikeda 

Graduate School of Information Science, Nara Institute of Science and Technology 
E-mail: takamitsu-a@is.naist.jp, kazushi@is.naist.jp 



(N 

o 

(N 

m 



O 

U 

C/2 



> 

in 

cn 
o 



X 



Abstract — Auxiliary variable methods such as 
the Parallel Tempering and the cluster Monte Carlo 
methods generate samples that follow a target distri- 
bution by using proposal and auxiliary distributions. 
In sampling from complex distributions, these algo- 
rithms are highly more efficient than the standard 
Markov chain Monte Carlo methods. However, their 
performance strongly depends on their parameters and 
determining the parameters is critical. In this paper, 
we proposed an algorithm for adapting the parame- 
ters during drawing samples and proved the conver- 
gence theorem of the adaptive algorithm. We applied 
our algorithm to the Parallel Tempering. That is, we 
developed adaptive Parallel Tempering that tunes the 
parameters on the fly. We confirmed the effectiveness 
of our algorithm through the validation of the adap- 
tive Parallel Tempering, comparing samples from the 
target distribution by the adaptive Parallel Tempering 
and samples by conventional algorithms. 

Keywords — Adaptive Markov Chain Monte Carlo, 
Auxiliary Variable Method, Parallel Tempering, Conver- 
gence 

1 Introduction 

Markov chain Monte Carlo (MCMC) methods have 
been an important algorithm in various scientific fields 
(Liu 2001, Robert and Casella 2004). MCMC meth- 
ods can generate samples that follow a target distri- 
bution by using a simple proposal distribution. How- 
ever, in sampling from complex distribution such as 
multi-modal, the standard MCMC methods produce 
samples theoretically converge the target distribution 
but practically do not. The produced samples can be 
trapped in a local mode for a extremely long period. 

To cope with this localization problem, the parallel 
tempering (PT) a.k.a. exchange Monte Carlo method 
was proposed (Geyer 1991; Hukushima and Nemoto 
1996). The PT algorithm introduces auxiliary dis- 
tributions with a parameter called the temperature, 
generates multiple MCMC samples from target and 
auxiliary distributions in parallel, and exchanges the 
positions of two samples. An auxiliary distribution is 
tempered when the temperature is high and one with a 
low temperature is similar to the target distribution. 
This "tempering" implementation and the exchange 
process help samples escape from a local mode. 

Auxiliary distributions are also used in other sev- 
eral algorithms. One example is the Gibbs variable se- 
lection in Bayesian variable selection, where auxiliary 
distributions approximate the marginal posterior dis- 
tribution of coefficient parameters for sampling from 



the joint posterior (Dellaportas et al. 2002). Another 
is the cluster Monte Carlo methods, efficiently produce 
samples by block-wise updates based on auxiliary dis- 
tributions (Swendsen and Wang 1987; Higdon 1998). 
These algorithms are referred to as auxiliary variable 
methods (AVMs) in this paper. 

The performance of an AVM depends on both the 
proposal distribution and the auxiliary distributions. 
Hence the parameters of the distributions have to be 
chosen so that the Markov chains of the AVM mix as 
faster as possible. They have been tuned by rough 
methods or trial-and-error in pilot runs so far because 
their relationship to the mixing speed has not been 
clear. 

For standard MCMC methods such as Metropolis- 
Hastings algorithm (Hastings 1970), Gilks et al. (1998) 
and Haario et al. (2001) proposed adaptive MCMC 
algorithms that tuned the parameters of a proposal 
distribution by using past samples during runs. Haario 
et al. (2001) also proved the convergence theorem of 
their algorithms, which was developed later (Andrieu 
and Moulines 2006; Roberts and Rosenthal 2007). 

In this paper, we proposed adaptive AVMs by 
extending the above adaptive algorithms to general 
AVMs, where the algorithms adapt the parameters of 
the proposal and the auxiliary distributions of AVMs 
during runs. We also proved the convergence theo- 
rem of our algorithm in a similar way to Roberts and 
Rosenthal (2007). 

We showed the effectiveness of adaptive AVMs by 
applying it to the PT algorithm and validating it nu- 
merically. That is, we developed an adaptive PT algo- 
rithm that tune the proposal parameters, which con- 
tain the number of temperatures, and temperatures 
while it runs, and showed the effectiveness of the algo- 
rithm via numerical experiments. We also proved the 
convergence of the adaptive PT algorithm by using the 
theorem of the general adaptive AVMs. 

The rest of the paper is organized as follows, the 
conventional PT algorithm is briefly shown in Section 
2 and an adaptive PT algorithm is proposed in Sec- 
tion 3. Experimental properties of our algorithm are 
examined in Section 4. In section 5, the proposed al- 
gorithm is extended to general AVMs and the conver- 
gence theorems of our algorithms are proved in Section 
6. Finally, we give a conclusion in Section 7. 

2 Parallel Tempering Algorithm 

The PT algorithm is a typical algorithm that uses 
auxiliary distributions, ir tl (dxi), I = 2,...,L, where 
1 = h > t 2 > ■ ■ ■ > t L > 0. The density of the 
Ith auxiliary distribution is parametrized by the in- 



verse temperature ti as ftt^x) oc 7r(x)*' or nt^x) cx 
Tv(x) tl p(x) 1 ~ tl , where 7r(x) is the density of the target 
distribution and p{x) is the density of a simple distri- 
bution that mix fast by using a conventional MCMC 
method. In other words, the inverse temperature ti 
tempers the multi-modality of the target distribution 
ir{dx) so that the auxiliary densities, 7r ti {xi), gradually 
connect the target density ir(x) to a simple density 
p{x) or the uniform distribution. 

The PT algorithm executes either of the parallel 
step and the exchange step at time n, with probability 
a r and 1 — a r , respectively. The parallel step generates 
the L samples, xj , I = 1, . . . , L, according to 7r t( (dxi) 
for each by using a standard MCMC method. Note 
that we employed the Metropolis algorithm with the 
proposal variance 7; in this paper. The exchange step 
randomly chooses a sample x\ n ^ from the L— 1 samples, 



» 



I — 1, . . . , L — 1, and exchange Xj for aij?^ with 



» 



probability 



min 1 



7r ti(x ( l n) )Tr tl+1 (x'i+i) / 



(1) 



The performance of the PT algorithm strongly de- 
pends on the inverse temperatures, more specifically, 
their intervals and their number. The interval of two 
adjacent inverse temperatures determines both the 
similarity of the two distributions and the acceptance 
probability of an exchange as seen in Eq. ([!}. The 
acceptance ratio for the exchanges, which is referred 
to as exchange ratio in this paper, should not take ex- 
treme value. For example, Liu (2001) said a preferable 
value is a half at any interval. To avoid extreme values 
and lead to homogeneous exchange ratios, Hukushima 
(1999) updated temperatures using a recursive formula 
through preliminary runs and Goswami and Liu (2007) 
tuned the intervals by iteratively estimating the ex- 
pected exchange probability through preliminary runs. 

Jasra (2007) treated the intervals as a sequence and 
experimentally compared three inverse-temperature 
sequences, equal space, logarithmic decay and power 
decay. The results showed the last was best. Nagata 
and Watanabe (2008) proved in the low temperature 
limit when the sequence of inverse temperatures is a 
geometric progression the exchange ratios are homo- 
geneous. However, the above methods only discussed 
the intervals and did not take into account the pro- 
posal distributions, on which the mixing of samples 
and the estimation of exchange ratio also depend. In 
our setting, for example, the Metropolis algorithm has 
a parameter to be determined, that is, the proposal 
variance 7/ . It is necessary to re-set the proposal vari- 
ance when the inverse temperatures are changed a lot, 
because the appropriate proposal variances obviously 
depend on the shape of auxiliary distributions. 

The more auxiliary distributions the PT algorithm 
has, the faster the samples are mixed because flat- 
ter auxiliary distributions are available but the more 
computational complexity is required. To solve the 
trade-off and determine an appropriate number of dis- 
tributions, Goswami and Liu (2007) proposed to select 



the maximum temperature using statistical tests. The 
tests should be done in an off-line manner, that is, they 
need preliminary experiments in advance. 

3 Adaptive PT Algorithm 

We propose an adaptive PT algorithm that adapts 
the inverse temperatures, the parameters of proposal 
distribution, and the minimum inverse temperature 
while the algorithm is running. The three adaptation 
algorithms are described below. 

The exchange ratio should take a moderate value. 
To converge the exchange ratio for xi-\ and x\ to a 
specific value, a e (0, 1), typically a half, the log in- 
verse temperature, Q = log(t;), is updated as 



^(n+1) ^_ ^.(n) 



where ERYj. , is a variable that takes one if the ex- 
change occurs between the samples, x["l and x ' 



a l n (ER^ 



1,1 



a 



(2) 



at 

time n, and zero otherwise. The learning coefficient, 
a l n , is a decreasing random variable with n that satis- 
fies linin^oo a n — almost sure. 

The proposal distribution of the Metropolis algo- 
rithm for a target and auxiliary distribution should 
have an appropriate variance, which is an average 
of the variances of all modes. To control the pro- 
posal distribution of the Metropolis algorithm for the 
distribution ir tl (dxi) on R p , the the variance 7; = 
(7;i, 7/p) € M p and the auxiliary mean parameter 
Mi = (/-*/! j •••) Hip) S K p are updated as 
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where x^ +1) is the jth element of x[ n+1) £ W. When 
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is updated to xj n+1 ^ by exchanging to x]'^ or 
XjV^, the /ij jumps to the exchanged value, that is, 



O) 



(n+l) 



>+i) 



fii ' +- x\ '. The learning coefficient, b n , is a 
decreasing function of n that satisfies linin^oo b n = 0. 

The auxiliary distribution with the minimum in- 
verse temperature should be so flat that Metropolis 
samples can frequently move from one mode to an- 
other while the total number of auxiliary distributions 
should be as small as possible. To determine an ap- 
propriate value for the minimum inverse temperature, 
the auxiliary distributions Tr tl (dxi) with I > I* are re- 
moved where I* is the smallest number that satisfies 



'{Xlj), 



(4) 
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where V ("J [xij ) is the sample variance of xij at time n. 
This check is done at time n = m, 2m, . . ., where m is 
a large number (e.g. 10 4 ). To improve the reliability, 
when inequality Q holds a few times d (e.g. 3) in 
succession the auxiliary distribution is determined to 
be enough flat. 

Inequality (j4|) shows the relationship between the 
sample variance and the proposal variance. Due to 



Eq. ([3]), the latter converges to the variance of local 
region and hence it is smaller than the sample vari- 
ance if Metropolis samples are localized in a mode. 
Otherwise, the auxiliary distribution is flat enough. 

A pseudo code of the adaptive PT algorithm is 
given in the following. 



Algorithm Adaptive PT algorithm 

Initialize a:. , Q > ^1°^ Mz^ anc ^ Q — ,1 = 
1,...,L. (£i =0, constant), 
for n = to N - 1 do 

u ~ t/[0, 1] (where t/[0, 1] is a uniform distribu- 
tion of the interval (0,1)). 
if u < a r then 
for I = 1 to L do 

(parallel sampling step) 

Generate x\ n+1 ^ via Metropolis Algorithm for 
7T.(n) (dxi), which has the proposal variances 

(n) 

7; ■ 

(proposal parameter learning step) 

Update ( 7i (n) , M | n) ) to ( 7i ( " +1) , tf +1) ) by the 
Eq. ©. 
end for 
else 

(exchange step) 

Randomly Choose a neighboring pair, x^ and 

(n) 

x l+11 and exchange them with the probability 
Eq. Q. 

(inverse temperature learning step) 

Update Ci+\ to Ci+i" 1 ' by Eq. ©. 
if the exchange is accepted, then 
4 n+1) ^4 n+1) ,forfc = /,/ + l. 
end if 
end if 

( minimum inverse temperature decision step) 
if (n mod m) == then 

for I = 1 to L do 

If Eq. g]) hold, then q «- cj + 1. 

end for 

L <- min{/|Q > d,J = l,...,L}, if {} ^ 0. 
end if 
end for 



The adaptive PT algorithm converges. The proof 
will be given later as a special case of adaptive MCMC 
algorithms for general auxiliary variable methods. 

4 Experiments 

To confirm the effectiveness of our algorithm, the 
following two computer experiments were carried out: 

1. A mixture of four normal distributions. 

2. The posterior of a mixture of six normal distri- 
butions. 

In each of the experiments, the burn-in period was 
a half of the total number of iterations and sample 
sets, which are used in an estimation and a scat- 
ter plot, are chosen from every 50 samples in post 



burn-in. The proposal distribution of the Metropo- 
lis algorithm was an independent normal distribu- 
tion. Other parameters were a — 0.5, a r — 0.5, 

a l n = 1/(1 + n/(20 + 100) * log(exp(-C i ( " ) ) + 1), 
b n = 1/(5 + O.ln), m = 10 4 and d = 3. L = 25 
and the intervals of inverse temperatures are equal, 
that is t(°) = (1,24/25, 23/25,..., 1/25), at the initial 
condition. Note that these values are invariant for the 
each above distribution, i.e., a tuning of these values 
was not necessary in these experiments. 

4.1 A mixture of four normal distributions 

To see and visualize the properties of our adap- 
tive PT algorithm, we chose a mixture of four normal 
distributions in two dimensional space as the target 
distribution (Fig. QJa)): 

g( x ) = 

4 . . 

g L^detW' 2 CXP T 2^ - - *)) , 

where the parameters of the normal distributions are 

Mi = (0,44), /is = (44,0), 

pa = (0,-44), M4 = (0,-44), 

Ex =diag(l,7 2 ), E 2 =diag(7 2 ,l), 

E 3 = diag(l,7 2 ), E 4 = diag(7 2 ,l). 

These normal distributions have quite different vari- 
ances (1 and 7 2 ), where the proposal variance learning 
is difficult. 

The adaptive PT algorithm ran for 3 x 10 5 itera- 
tions, where the auxiliary distributions are n tl (x) cx 
g(x) tl and the initial proposal variances are 7y = 
3 x 10 2 . 

As a result, our algorithm mixed well and obtained 
samples from all possible modes (Fig. QJb)). In fact, 
the number of inverse temperatures was reduced to 
five after 3 x 10 4 iterations but the auxiliary distribu- 
tion TTf (dx) is flat enough (Fig. [2]), where t$ is the t& 
obtained by the adaptive PT algorithm. 

The larger the variances of the auxiliary distribu- 
tion became, the larger the proposal variances became. 
In fact, the sums of proposal variances are (71.1 + 
71,2, ■ • • , 75,1 + 75,2) = (32.26, 41.86, 245.8, 1124, 8704). 

The estimated exchange ratios converged to 
(0.501, 0.507, 0.499, 0.498), all of which are almost a = 
0.5. Then, the inverse temperatures were (t,2, ■ ■ ■ , is) = 
(0.328, 0.108, 0.0307, 0.00937). 

trj^ and 72 converge quickly from even the ex- 
treme starting points (Fig. The others also con- 
verge as fast as them. 

4.2 The posterior of a mixture of six normal 
distributions 

In the second experiment, we estimated the average 
of component specific means of given the data using 
Bayesian estimation as is seen in Jasra et al. (2007). 



(a) (b) 
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Figure 1: A mixture of four normal distributions, a) The target distribution, b) Samples by the Adaptive PT 
algorithm. 
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Figure 3: trace plot (a):proposal variance , (b):inverse temperature t^' . 
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Figure 2: Samples by Metropolis algorithm with 75 
from the auxiliary distribution n^^dx). They cover 
all the modes in Fig. Q] 

The statistical model was a mixture of normal distri- 
butions, that is, 

f(y\n,w,<r 2 ) 

=£vs: exp ("M: (y " Mro)2 J' (5) 

where wm = 1 — Yl^—i Wm ' ^ ne P r i° rs are a normal- 
inverse Gamma-Dirichlet prior as follows. 

^ m ~ A(£, k 2 ), m = f,...,M, 
ff m ~ IG(a g , P g ), m = l,...,M, 
w m ~V(g), m=l, ...,M-1, 

where T>(g) is the symmetric Dirichlet distribution 
with parameter g. In the following, the hyper- 
parameters were a g — 12, f3 g — 10 and g — 1. The 
parameters £ and k 2 were determined by the median 
and four times the variance of the given data. 

The data of size 150, yi : i5o, were independently 
identically distributed according to a mixture model 
of the form with parameters, M = 6, w\ = • • • = 
w 6 = 1/6, (mi,..., Me) = (-8,-3,1,4,8,13), a 2 = 
erf = 1.5 2 and <r 2 = • • ■ = erf = 0.5 2 . In this case, 
the posterior 7r(/z, w, a 1 1 j/i : i5o) was a 17 dimensional 
distribution and had 6! = 720 symmetric modes due 
to the invariance against permutation of the labels of 
the parameters. 

The auxiliary distributions were set to 

/150 \ tl 

7r ti (M,w,CT 2 |?/i : i5o) cx I ~\\f{yi\n,w,(j 2 ) J p(^,w,ct 2 ), 

where p(/x,u>,<7 2 ) was the prior. 

Our algorithm was compared to the conventional 
PT algorithm with the fixed parameters. The param- 
eter values of the conventional algorithm were shifted 
from the values obtained by the adaptive PT algorithm 
as follows. 



(a) Ci <- 6 ■ <P(, far J = 1, ... . (0-5 < ^ < 3)- 
L «- £, 7j «- 7j. 

(b) 7, <- 7; • <^ 2 , for Z = 1, . . . , L, (0.1 < ^ 7 < 3). 
L <- L, d <- Q. 

(c) L <- L + tp L , (-5 < </?l < 5). 
7( <- % Ci <- Ci- 

(If ipL > 0, Ci and 7/ were learned, for Z = L + 
1, L + tpi,, to maintain the fairness.) 

We ran the adaptive PT algorithm and the con- 
ventional PT algorithms for 10 6 iterations respec- 
tively. The initial condition were wf^L — 1/6, c 2 ^ ~ 
IG(a g ,/3 g ), ^,ra ~ C/[min(j/i : i 50 ),max(yi : i5o)] for 

each run. The initial values of "f[°\ ■ ■ ■ , 7^ were the 
sorted L random numbers from [/[0. 0001, 800]. The 
variables of posterior were divided into four blocks, 
the numbers of which are (5,4,4,4). Each Metropolis 
algorithm updated for the every block. 

We evaluated the estimators of fj, m , m = 1, . . . , 6. 
The accuracy of the estimation was evaluated by the 
root mean square error (RMSE). To evaluate the to- 
tal error of the six estimators, RMSE takes the root 
average of the errors of the six estimators, that is, 

/ 6 XV2 

RMSE(*)= - ^(/2 m (*)-2.5) 2 , 

\ m=l / 

where jl m (i) is the estimator, which is the average of 
samples in the ith trial, and 2.5 is the true value. 

As a result, our algorithm can obtain appropriate 
parameters and achieve very low RMSEs. Fig. IDJa) 
and (b) show the RMSEs of the adaptive PT algorithm 
and the conventional PT algorithms in 50 runs were 
less than those of the conventional PT algorithms with 
shifted parameters. On the other hand, even if the 
number of temperatures increases the RMSEs don't 
increase (Fig.[4tc)), but the computational costs of the 
algorithms increase. In fact, the shifted inverse tem- 
peratures could not control the exchange ratios well 
(Fig.®. 

5 Generalization to Auxiliary Variable Meth- 
ods 

The idea of the adaptive PT algorithm is applica- 
ble to general AVMs. AVMs are mathematically for- 
mulated as below. 

Let n(dx) be a distribution on a state space X 
with a- algebra Fx and n\(dy\x) be a conditional dis- 
tribution on a state space y with er-algebra Fy given 
Fx, where A € A is a parameter vector. Then, the 
marginal distribution on X of the joint distribution 
■K\(dx,dy) = Tt\(dy\x)Tt(dx) is Tt(dx) irrespective of 
wx(dy\x). 

In case of MCMC methods with auxiliary vari- 
ables, Tr(dx) corresponds to the target distribution 
and ir\(dy\x) to the auxiliary distributions. We term 
an MCMC method that draw samples (x',y') from 
■K\(dx,dy) to obtain x' an auxiliary variable method. 
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Figure 4: RMSEs in 50 runs. Each plot displays an average and a standard error of RMSEs for each algorithm 
in a mark and a radius of the error bar, respectively. The adaptive PT algorithm : (o). The conventional PT 
algorithms with the shifted parameters are separately plotted in these figures. The inverse temperatures : x in 
(a), The proposal variances : o in (b), The number of inverse temperatures : A in (c). 




distribution must converge. These conditions will con- 
siderably restrict the available learning algorithms. 

In this section, we show some convergence theo- 
rems that our algorithm in the previous section con- 
verges under weaker conditions. Here, convergence 
means that an algorithm is ergodic, that is, 

lim \\A( n \(x,y,d),dx) -n(dx)\\ = 0, 

n— too 

V(a;, y) G X x y, 9 G 9, 
where \\pt,(dx) — u(dx)\\ = sup AeJ r x \n{A) — u(A)\ and 
A^((x,y,9),B x ) 



P 



>' £B X \ X W =x,yl (S > =y,6V> 



B x e F, 



x- 



Pairs of auxiliary distributions 

Figure 5: The averages of the estimated exchange ra- 
tios in 50 runs. The ratios of the adaptive PT al- 
gorithm were almost the predefined value while the 
conventional PT algorithm with shifted inverse tem- 
peratures could not be tuned constant and a moderate 
value (0.5), a dotted line. 



In the PT algorithms, for example, the auxiliary distri- 
butions are n\(dy\x) = 11^=2 n t t (dxi), A = (t 2 , . . . ,t L ) 
and the auxiliary variables y — (x 2 , ...,£/,). 

In order to introduce adaptation, we need 
to consider time-varying parameters. Let 
{Pg((x, y), (dx, dy))}e^e be a family of Markov 
transition kernels on X x y with stationary distribu- 
tion Tr\(dx,dy), that is, 



Theorem 1. The adaptive MCMC for AVM is ergodic 
if the following conditions hold: 

(a) Simultaneous uniform ergodicity 

Ve > 0, 3N e N s.t. 

\\P e N ((x,y),dx)-?r(dx)\\<e, 

v{x, y ) g x x y,e e e. 



(6) 



(b) Diminishing adaptation 



lim^oo sup 

(x, y )exxy 

\\Pg(,i+D ((x,y), (dx,dy)) - P gM ((x,y), (dx,dy)) 
= in probability. (7) 



Proof. See Appendix |A"1 



□ 



(n x P e ) {dx,dy) = / / n x (dx' ,dy')P e ((x\y'), (dx,dy)) 
= ir x {dx,dy), 

where A C 6. Then, the adaptive MCMC for AVM 
updates the parameters 9 during generating chains 
(x( n \y( n ') by Pg as the following pseudo code. 

Algorithm Adaptive MCMC for AVM 

Initialize (x^°\y^), 9 <°>. 
for n = to N - 1 do 

[1] (x( n +V, y( n +V) ~ P flW ((iW , (di, 
[2] Update 0M to 6>(" +1 ) by using the result of 
step 1. such as (a;( n+1 ), y {n +^). 
end for 

In the adaptive PT algorithms, for exam- 
ple, the time-varying parameter vector is 9 = 
(71, ...,7z,;*2, 

6 Convergence Theorem 

Atchade (2011) and Fort et al. (2011) proved con- 
vergence theorems of adaptive MCMC algorithms that 
learn the parameters of the target distribution. The 
conditions for convergence in their theorems are, how- 
ever, technical and strict. For example, the stationary 



The above conditions do not require that the aux- 
iliary parameter A^™^ and the stationary distribution 
7r A („) converge. The condition (a) can be replaced with 
more concrete condition that checks only properties of 
the Markov transition kernel as follows. 

(a') (Simultaneously strongly aperiodically geometri- 
cal ergodicity) There exists C G J-xxy, V : 
X x y ->• [l,oo) , S > 0, r < 1, and b < 00, 
such that sup c U < 00 and the following condi- 
tions hold for all 9 G 0. 

(i) (Strongly aperiodic minorisation condition) 

There exist a probability mea- 
sure u$ {dx, dy) on C such that 
Pg((x,y),(dx' ,dy')) > Sug(dx',dy') for 
all x,y G C. 

(ii) (Geometric drift condition) 

(P e V) (x,y) < rV(x,y)+bl {c} (x,y), 

for all x,y G X x y, 

where (PgV)(x,y) 

= JJ Pg ((x, y), (dx', dy')) V(x', y')dx'dy\ 
and 

\i.\(x) is the indicator function. 



Theorem 2. The adaptive MCMC for AVM is ergodic 
if the condition (b) in Theorem 1, the condition (a 7 ) 
and E{V(x {0 \y^)\ < oo hold. 

Proof. Straightforward from Proposition 3 and the 
proof of Theorem 3 in Roberts and Rosenthal (2007), 
and Theorem 1. □ 

Theorem 3 (Weak law of large numbers). Suppose 
an adaptive MCMC for AVM satisfies the conditions 
(a) and (b) and let g : X — » M be a bounded measurable 
function. Then, 

1 - f 

— ~^^g(x^) — > l g(x)n(dx) in probability 

i—l 

as n —¥ oo for any initial values {x, y) G X x y and 

Proof. Straightforward from the coupling argument 
(Roberts and Rosenthal 2007). □ 

The convergence of the adaptive PT algorithm is 
proved by applying Theorem 2 as below. 

Theorem 4. The adaptive PT algorithm is ergodic if 
the following conditions hold: 

(si) The support S of the target distribution n(dx) is 
compact and the density ir(x) is continuous and 
positive on S . 

(s2) The family of proposal densities {<7 7 } 7 erp is con- 
tinuous and positive on S 2 xT p , where T = [c, C] . 

Proof. See Appendix [Bl □ 

It will be possible to remove the assumption that 
S is compact by extending Theorem 6 of Bai et 
al. (2011). 

7 Conclusions 

This paper proposed adaptive MCMC algorithms 
to learn parameters of proposal distributions and aux- 
iliary distributions simultaneously, and proved conver- 
gence theorems that give weak sufficient conditions for 
convergence. 

We applied this framework to the Parallel Tem- 
pering algorithm and showed that the adaptive PT 
algorithm can adapts its parameters on the fly so that 
samples mix rapidly by experiments with a mixture 
model. We also presented that the performance of 
the PT algorithm depends on its parameters and the 
adaptive PT algorithm finds good parameters through 
experiments for Bayesian estimation. 

Although we discussed the PT algorithm in a real 
space so far, we consider the idea of adaptation is ap- 
plicable to those in a discrete space. We also consider 
our adaptive framework is applicable to other auxil- 
iary variable methods such as the Gibbs variable selec- 
tion, the partial decoupling method (one of the cluster 
Monte Carlo methods) and so on. We will extend our 
theory to the new field in the future. 



Appendix 

A Proof of Theorem 1 

Let e > 0, and choose N G N as in condition (a). 
From condition (b) and the coupling argument in the 
proof of Theorem 1 of Roberts and Rosenthal (2007), 
the following result holds. 

There exists n* G N large enough so that 
for K > n* + N, there exists a second chain 
{x /( * n \y^}K =K _ N , such that (a^*-*),^*-*)) = 
y(*-">) and 

( x /(«+i) j2/ /(n+i)) _ P e{K - N) ((x' {n \y'^),dx,dy) for 
n = K — N, K — 1, and P(x^ ^ x' (K ^) < It. 
Then it follows that 

\\P{x^ K) G da) - P{x' (K) edx)\\< 2e, (8) 

where P(x^ G dx) denotes the distribution of x^ K \ 
(Because of \\P(y £ dx) - P(z G dx)\\ < P(y ^ z).) 

On the other hand, from the condition (a), for all 
Ax G Fx, we have 



e > 



E[P e ^ N) ((x( K - N \y( K - N ^A x )-n(A x )] 
P{x'^ G A x ) - n(A x ) 



That is, 

\\P{x' {K ^ e dx) - ir(dx)\\ <e. (9) 
From inequality ([5} and ©, we have 

||P(xW G dx) - 7r(ds)|| < 3e. (10) 
Since K > n*+N is arbitrary, the algorithm is ergodic. 

B Proof of Theorem 4 

We prove the sufficient conditions of convergence 
in Theorem 2 are satisfied. Firstly, we prove the con- 
dition (a') holds. 

Let Borel cr-algebra on W be B(W). For x E S L , 
7 G TP L , t G T L and B = B 1 x B 2 x • ■ • x B L , B x G 
B(S), the transition kernel of the PT algorithm is 

L L 

K~ t (x,B) = a r JJP 7l)tl (af|,£fj) + (1 - a r ) ^ Qk U -i(x, B), 



i=i 



1=2 



(11) 



where < q < 1, J2i=2^ = 1' p nA( x i^ dx i) an d 
ki : i^i(x,dx') are the Metropolis transition kernel for 
7Tfj (dxi) and the transition kernel of an exchange pro- 
cess of xi and xi-i, respectively. 

By condition (si), we have d = sup 2 , 6S tg7 -7rt(x) < 
oo. By the compactness of S and condition (s2), we 
have also S = ini XtX 'es,-yerp q-y(x,x') > 0. 

For x G S and t G T, denote R X:t = 
{y G S\^§ < l}. For Xl G S, B t G B(S), U G T 



and 7; S r, we have 
Ry litl (xi,Bi) 



g 7i (x /; xj)min ^1, 



dxi 



+ 1 {B l }(xi) / q 7i (x[,X[) < 1 - min I 1, 



q ll (xi,x[) 1 ^T^ldx', 



dxi 



B t nR% 



> 



7r ti {x'^dxl + - 



ntiix'^dx'i 



From Eq. (TTTT) . this inequality leads to 

K ltt {x,B)>a r \{P^t l {x h B l ) 



L 



i=i 
5 L 



(12) 



where ^t{B) = \\ l=1 ir tl (Bi) is a probability measure 
on S L . Since the inequality (fT2j) holds for all J3 G 
B(S L ), the condition (a')(i) follows. 

Let < t < 1, V(a;) = 1 if x G 5' L , otherwise 
V(a:) = 1/r, and 6 = 1 — r. Then we have 



(K l t V)(x) < tV(x) + W^W, G 



(13) 



This inequality implies that the condition (a')(ii) is 
satisfied. Also we have E[V{x^°\ yW)] < 1/r < oo. 



From Eq. © and it follows that i 



(n+l) An) 



a.s. and 7; — 7^" ; — > as n — > oo. The mini- 
mum inverse temperature decision process changes the 
value of q only finite times. Thus, the condition (b) 
in Theorem 1 holds. 

The proof is complete. 
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